Lessons Learned from Large Scale Evaluation of Systems that Produce Text: Nightmares and Pleasant Surprises
نویسنده
چکیده
As the language generation community explores the possibility of an evaluation program for language generation, it behooves us to examine our experience in evaluation of other systems that produce text as output. Large scale evaluation of summarization systems and of question answering systems has been carried out for several years now. Summarization and question answering systems produce text output given text as input, while language generation produces text from a semantic representation. Given that the output has the same properties, we can learn from the mistakes and the understandings gained in earlier evaluations. In this invited talk, I will discuss what we have learned in the large scale summarization evaluations carried out in the Document Understanding Conferences (DUC) from 2001 to present, and in the large scale question answering evaluations carried out in TREC (e.g., the definition pilot) as well as the new large scale evaluations being carried out in the DARPA GALE (Global Autonomous Language Environment) program. DUC was developed and run by NIST and provides a forum for regular evaluation of summarization systems. NIST oversees the gathering of data, including both input documents and gold standard summaries, some of which is done by NIST and some of which is done by LDC. Each year, some 30 to 50 document sets were gathered as test data and somewhere between two to nine summaries were written for each of the input sets. NIST has carried out both manual and automatic evaluation by comparing system output against the gold standard summaries written by humans. The results are made public at the annual conference. In the most recent years, the number of participants has grown to 25 or 30 sites from all over the world. TREC is also run by NIST and provides an annual opportunity for evaluating the output of question-answering (QA) systems. Of the various QA evaluations, the one that is probably most illuminating for language generation is the definition pilot. In this evaluation, systems generated long answers (e.g., paragraph length or lists of facts) in response to a request for a definition. In contrast to DUC, no model answers were developed. Instead, system output was pooled and human judges determined which facts within the output were necessary (termed “vital nuggets”) and which were helpful, but not absolutely necessary (termed “OK nuggets”). Systems could then be scored on their recall of nuggets and precision of their response. DARPA GALE is a new program funded by DARPA that is running its own evaluation, carried out by BAE Systems, an independent contractor. Evaluation more closely resembles that done in TREC, but the systems’ scores will be compared against the scores of human distillers who carry out the same task. Thus, final numbers will report percent of human performance. In the DARPA GALE evaluation, which is a future event at the time of this writing, in addition to measuring properties such as precision and recall, BAE will also measure systems’ ability to find all occurrences of the same fact in the input (redundancy). One consideration for an evaluation program is the feel of the program. Does the evaluation program motivate researchers or does it cause headaches? I liken Columbia’s experience in DUC and currently in GALE to that of Max in Where the Wild Things Are by Maurice Sendak. We began with punishment (i.e., if you don’t do well, your funding will be in jeopardy), encounter monsters along the way (seemingly arbitrary methods for
منابع مشابه
A survey on Automatic Text Summarization
Text summarization endeavors to produce a summary version of a text, while maintaining the original ideas. The textual content on the web, in particular, is growing at an exponential rate. The ability to decipher through such massive amount of data, in order to extract the useful information, is a major undertaking and requires an automatic mechanism to aid with the extant repository of informa...
متن کاملHospital Management in Infectious Disease Outbreak: Lessons Learned from COVID-19
Background: Biological events including epidemics, pandemics, emerging, and reemerging infectious diseases have significant adverse consequences on health. The hospitals have a major role in the management of outbreaks and mitigation of effects. During pandemics health systems especially, hospitals may be affected. Methods: Therefore, the current study aimed to collect and analyze lessons lea...
متن کاملImportant Lessons Learned From Nearly a Half-Century of Orthopedic Practice
“Those who cannot remember the past are condemned to repeat it” [1]. The famous quote from Hispanic American philosopher George Santayana reminds us of the critical importance of constantly reflecting on the most important lessons garnered from both our own personal experiences and those of our peers. In 49 years of academic orthopedic practice, I have frequently reflected on the most important...
متن کاملThe Evaluation of Question Answering Systems: Lessons Learned from the TREC QA Track
The TREC question answering (QA) track was the first large-scale evaluation of open-domain question answering systems. In addition to successfully fostering research on the QA task, the track has also been used to investigate appropriate evaluation methodologies for question answering systems. This paper gives a brief history of the TREC QA track, motivating the decisions made in its implementa...
متن کاملRegionalization of the Iowa State University Extension System: Lessons Learned by Key Administrators
The cyclical economic downturn in the United States has forced many Extension administrators to rethink and adjust services and programming. The Cooperative Extension System (CES), the organization primarily responsible for governmental Extension work in the United States, at Iowa State University responded to this economic downturn by restructuring its organization from county based to a regio...
متن کامل